{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In \"Leveraging Researcher Domain Expertise_Replication Code_A\", we laid out the procedure for the simulation of our method on the New York Times Front Page dataset, as this best-emulates the use-case for political science research.\n",
    "\n",
    "Here, we lay out two additional simulation schemes also noted within our manuscript that were run on the 20 NewsGroups dataset.\n",
    "\n",
    "Specifically, we noted the effectiveness of our method in two class distribution schemes - on the original, raw dataset and on a dataset where we skew the balance between categories - in each run choosing a single category to undersample, and three other categories to remain much larger (at their original size). \n",
    "\n",
    "Here, we lay out the steps for the second distribution scheme (as the first one simply runs on the existing class scheme and distribution).\n",
    "\n",
    "May 2022"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Load the text embeddings:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We use here the same SentenceTransformers model as those used in the previous script, however, the 20 NewsGroups dataset consists of short text segments. Thus, when we converted each one of the segments to an embedding, we kept them as is, without converting them to the document level. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import os\n",
    "import numpy as np\n",
    "import pickle"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "### DIRECTORY WITH THE DATASET:\n",
    "## change to the directory location on user computer\n",
    "\n",
    "DIR = r\"E:\\Simulation_Replication Materials\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "## Load the embeddings for the corpus\n",
    "embds = os.path.join(DIR, \"20newsgroups_forSimulation_Embeddings.pickle\")\n",
    "\n",
    "pickle_in = open(embds,\"rb\")\n",
    "embeddings = pickle.load(pickle_in)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "## Load the CSV with the metadata for each embedding - text, category, etc.\n",
    "\n",
    "df = pd.read_csv(os.path.join(DIR, '20newsgroups_forSimulation_Metadata.csv'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "(len(embeddings))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Running the Simulations"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "### Add the document embeddings to the dataframe\n",
    "\n",
    "df['X'] = [np.array(x) for x in embeddings]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "### list of categories:\n",
    "\n",
    "CATEGORIES = list(set(df['Category']))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import random\n",
    "\n",
    "np.random.seed(17)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "## function to randomly choose categories for classification:\n",
    "\n",
    "def choose_random_categories(list_of_categories, number_of_cats):\n",
    "    return(random.sample(list_of_categories, number_of_cats))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "### Due to their already being an imbalance, \n",
    "### we prepare two lists of the categories to choose from:\n",
    "\n",
    "## smaller categories\n",
    "potential_rare = ['sport','automobile','religion','medicine',\n",
    "                 'sales','alt.atheism']\n",
    "\n",
    "## larger categories\n",
    "potential_freq = ['computer','science','politics']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "### undersample small categories, maintain consistent large samples for bigger categories:\n",
    "def filter_categories_for_pool(rare_cats, freq_cats, df):\n",
    "    total_df = pd.DataFrame()\n",
    "    \n",
    "    for cat_ in rare_cats:\n",
    "        temp = df[df['Category'] == cat_]\n",
    "        temp = temp.sample(n=400)\n",
    "        total_df = pd.concat([total_df, temp])\n",
    "        \n",
    "    for cat_ in freq_cats:\n",
    "        temp = df[df['Category'] == cat_]\n",
    "        temp = temp.sample(n=2600)\n",
    "        total_df = pd.concat([total_df, temp])\n",
    "    \n",
    "    return total_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "## Simulation parameters:\n",
    "\n",
    "## How many times to run the simulation:\n",
    "RUNS = 300\n",
    "\n",
    "## How many samples to extract at each sampling iteration:\n",
    "STEPSIZE = 18\n",
    "\n",
    "## Number of iterations for each simulation run \n",
    "## (The more iterations, the more sampling rounds executed)\n",
    "NUM_STEPS = 100"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "## Imports for simulation functions:\n",
    "\n",
    "from sklearn.metrics.pairwise import cosine_similarity\n",
    "from sklearn.datasets import make_classification\n",
    "from sklearn.metrics import classification_report\n",
    "from sklearn.model_selection import train_test_split\n",
    "from scipy import stats\n",
    "from scipy.stats import entropy\n",
    "from sklearn.svm import LinearSVC\n",
    "from sklearn.svm import SVC\n",
    "\n",
    "## TQDM Notebook - not required, but allows the user to track simulation progress:\n",
    "from tqdm import tqdm_notebook"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# trains an SVM on X and y and returns the classifier\n",
    "def train_svm(X, y, proba=False): # add hyperparameter tuning\n",
    "    clf = SVC(probability=proba) # ASK GUY IF NEED TRUE\n",
    "    clf.fit(X, y)\n",
    "    return clf\n",
    "\n",
    "# note - for avg_calculation can do either \"macro\", \"micro\" or \"weighted\"\n",
    "# note - removed accuracy from the function\n",
    "\n",
    "def eval_svm(clf, testX, testY, avg_calculation):\n",
    "    predY = clf.predict(testX)\n",
    "    f1 = f1_score(testY, predY, average=avg_calculation)\n",
    "    return f1\n",
    "\n",
    "\n",
    "def calculate_centroids(df, categories):\n",
    "    centroids_ = []\n",
    "    \n",
    "    for c in categories:\n",
    "        temp = df[df['y'] == c]\n",
    "        vecs = []\n",
    "        for i,r in temp.iterrows():\n",
    "            vecs.append(r.X)\n",
    "        meanvec = np.mean(vecs, axis=0)\n",
    "        centroids_.append(meanvec)\n",
    "        \n",
    "    return centroids_\n",
    "\n",
    "def plot_f1s(performance_dict):\n",
    "    ### graph the f1 scores by category:\n",
    "    plt.figure(figsize=(12,7))\n",
    "    plt.xlabel(str('Samples'), fontsize=14, fontweight='bold')\n",
    "    plt.ylabel(str('f1'), fontsize=14, fontweight='bold')\n",
    "    for k in performance_dict.keys():\n",
    "        plt.plot(samples, performance_dict[k], label=k)\n",
    "    plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', title='Class (# samples in Test Set)')\n",
    "    plt.tight_layout()\n",
    "    plt.show()\n",
    "\n",
    "def create_classification_report(clf, testX, testY, output_dict=True):\n",
    "    predY = clf.predict(testX)\n",
    "    results = classification_report(le.inverse_transform([int(x) for x in testY]), le.inverse_transform([int(x) for x in predY]), output_dict=output_dict)\n",
    "    return results\n",
    "\n",
    "def get_weights(last_perfs, inflation_factor = 1.5):\n",
    "# get weights for guided search based on latest per-category f1 scores, as saved in last_perfs\n",
    "# inflation_faction determines the gap in probability of sampling between categories with good and \n",
    "# bad f1 scores\n",
    "# probabilities are calculates as softmax(f1*inflation_factor)\n",
    "    f1s = []\n",
    "    for k in last_perfs.keys():\n",
    "        if k in [str(x) for x in le.classes_.tolist()]:\n",
    "            #print(k)\n",
    "            f1s.append((1-last_perfs[k][\"f1-score\"])*inflation_factor)\n",
    "    return softmax(f1s)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "### create the function for the actual running of the methods:\n",
    "\n",
    "def run_annotations(pooldf, filtered_categories, test, STEPSIZE=STEPSIZE, NUM_STEPS=NUM_STEPS):\n",
    "    # lists to store results for each iteration\n",
    "    samples = [] # number of samples\n",
    "    choices = []\n",
    "\n",
    "    # f1 values for each strategy\n",
    "    f1s_random = []\n",
    "    f1s_active = []\n",
    "    f1s_ours_original = []\n",
    "    f1s_ours_closer = []\n",
    "\n",
    "    # pool of unannotated texts for each strategy\n",
    "    pooldf_ours_original = pooldf.copy()\n",
    "    pooldf_ours_closer = pooldf.copy()\n",
    "    pooldf_random = pooldf.copy()\n",
    "    pooldf_active = pooldf.copy()\n",
    "\n",
    "    # core set of annotated texts for each strategy\n",
    "    annotated_ours_original = pooldf_ours_original.groupby(\"y\").sample(20) #  - start with samples from all\n",
    "    annotated_ours_closer = annotated_ours_original.copy()\n",
    "    annotated_random = pooldf_random.sample(annotated_ours_original.shape[0])\n",
    "    if len(set(annotated_random['y'])) == 1:\n",
    "        annotated_random.drop(annotated_random.tail(1).index,inplace=True)\n",
    "        annotated_random = annotated_random.append(pooldf_ours_original.groupby(\"y\").sample(1))\n",
    "    annotated_active = annotated_random.copy()\n",
    "\n",
    "    # remove annotated texts from pool\n",
    "    pooldf_random = pooldf_random.drop(annotated_random.index)\n",
    "    pooldf_active = pooldf_active.drop(annotated_active.index)\n",
    "    pooldf_ours_closer = pooldf_ours_closer.drop(annotated_ours_closer.index)\n",
    "    pooldf_ours_original = pooldf_ours_original.drop(annotated_ours_original.index)\n",
    "    \n",
    "    # main loop - number of iterations\n",
    "    for i in (range(NUM_STEPS)):\n",
    "        ### add in the first step as taking the preliminary sample collection\n",
    "        \n",
    "        if i == 0:\n",
    "            annotated_ours_original = annotated_ours_original\n",
    "            annotated_ours_closer = annotated_ours_closer\n",
    "            annotated_random = annotated_random\n",
    "            annotated_active = annotated_active\n",
    "        \n",
    "        elif i > 0:\n",
    "\n",
    "            ############################ OUR METHOD - ORIGINAL #################################\n",
    "            ### our method - sample from each strata for a category chosen at random\n",
    "\n",
    "            c = calculate_centroids(annotated_ours_original, filtered_categories)\n",
    "            similarities = cosine_similarity(np.array(pooldf_ours_original[\"X\"].to_list()), c)\n",
    "\n",
    "            for ii, cc in enumerate(filtered_categories):\n",
    "                pooldf_ours_original[cc] = [x[ii] for x in similarities]\n",
    "\n",
    "            strata_o = [(0.8, 0.9), (0.7, 0.8), (0.6, 0.7), (0.5, 0.6), (0.4, 0.5), \n",
    "                      (0.3 ,0.4)]\n",
    " \n",
    "\n",
    "            ### randomly choose category\n",
    "            cat_choice = random.choice(filtered_categories)\n",
    "            #calculate empirical strata for category\n",
    "            emp_strata = [np.quantile(pooldf_ours_original[str(cat_choice)], [x, y]) for x, y in strata_o]\n",
    "\n",
    "            new_ours_orig = []\n",
    "\n",
    "            for strata in strata_o:\n",
    "                # select only subset corresponding to strata\n",
    "                temp = pooldf_ours_original[(pooldf_ours_original[str(cat_choice)] > strata[0]) & (pooldf_ours_original[str(cat_choice)] <= strata[1])]\n",
    "\n",
    "                if len(temp) < (STEPSIZE / (len(strata_o))):\n",
    "                    newsamp = temp\n",
    "                    new_ours_orig.append(newsamp)\n",
    "                else:\n",
    "                    newsamp = temp.sample(int(STEPSIZE / (len(strata_o))))\n",
    "                    new_ours_orig.append(newsamp)\n",
    "                # remove it from pool\n",
    "                pooldf_ours_original = pooldf_ours_original.drop(newsamp.index)\n",
    "\n",
    "            new_ours_original = pd.concat(new_ours_orig)\n",
    "\n",
    "            ############################ OUR METHOD - MODIFIED STRATA #################################\n",
    "            ### our method - sample from each strata for a category chosen at random\n",
    "\n",
    "            c = calculate_centroids(annotated_ours_closer, filtered_categories)\n",
    "            similarities = cosine_similarity(np.array(pooldf_ours_closer[\"X\"].to_list()), c)\n",
    "\n",
    "            for ii, cc in enumerate(filtered_categories):\n",
    "                pooldf_ours_closer[cc] = [x[ii] for x in similarities]\n",
    "\n",
    "\n",
    "            ## closer strata for larger corpus\n",
    "            strata_o = [(0.95, 1), (0.9, 0.95), (0.85, 0.9), (0.8, 0.85), (0.7, 0.8), \n",
    "                      (0.6 ,0.7)]   \n",
    "\n",
    "            ### randomly choose category\n",
    "            cat_choice = random.choice(filtered_categories)\n",
    "            #calculate empirical strata for category\n",
    "            emp_strata = [np.quantile(pooldf_ours_closer[str(cat_choice)], [x, y]) for x, y in strata_o]\n",
    "\n",
    "            new_ours_closer = []\n",
    "\n",
    "            for strata in strata_o:\n",
    "                # select only subset corresponding to strata\n",
    "                temp = pooldf_ours_closer[(pooldf_ours_closer[str(cat_choice)] > strata[0]) & (pooldf_ours_closer[str(cat_choice)] <= strata[1])]\n",
    "                ## sample:\n",
    "\n",
    "                if len(temp) < (STEPSIZE / (len(strata_o))):\n",
    "                    newsamp = temp\n",
    "                    new_ours_closer.append(newsamp)\n",
    "                else:\n",
    "                    newsamp = temp.sample(int(STEPSIZE / (len(strata_o))))\n",
    "                    new_ours_closer.append(newsamp)\n",
    "                # remove it from pool\n",
    "                pooldf_ours_closer = pooldf_ours_closer.drop(newsamp.index)\n",
    "\n",
    "            new_ours_closer = pd.concat(new_ours_closer)\n",
    "\n",
    "\n",
    "            ############################ RANDOM SAMPLING #############################################\n",
    "            new_random = pooldf_random.sample(STEPSIZE)\n",
    "            pooldf_random = pooldf_random.drop(new_random.index)\n",
    "\n",
    "            ############################ ACTIVE LEARNING #############################################\n",
    "            clf = train_svm(annotated_active[\"X\"].to_list(), annotated_active[\"y\"].to_list(), proba=True)\n",
    "            pooldf_active[\"metric\"] = entropy(clf.predict_proba(pooldf_active[\"X\"].to_list()).transpose())\n",
    "            pooldf_active = pooldf_active.sort_values(by=\"metric\", ascending=False)\n",
    "            new_active = pooldf_active.iloc[0:STEPSIZE]\n",
    "            pooldf_active = pooldf_active.drop(new_active.index)\n",
    "\n",
    "\n",
    "            ###############################################################################################\n",
    "            ###################################### END SAMPLING STAGE #####################################\n",
    "            ###############################################################################################\n",
    "\n",
    "            # add newly sampled texts to each of the training sets\n",
    "            annotated_ours_original = pd.concat([annotated_ours_original, new_ours_original])\n",
    "            annotated_ours_closer = pd.concat([annotated_ours_closer, new_ours_closer])\n",
    "            annotated_random = pd.concat([annotated_random, new_random])\n",
    "            annotated_active = pd.concat([annotated_active, new_active])\n",
    "\n",
    "        # train models\n",
    "        model_ours_original = train_svm(annotated_ours_original[\"X\"].to_list(), annotated_ours_original[\"y\"].to_list())\n",
    "        model_ours_closer = train_svm(annotated_ours_closer[\"X\"].to_list(), annotated_ours_closer[\"y\"].to_list())\n",
    "        model_random = train_svm(annotated_random[\"X\"].to_list(), annotated_random[\"y\"].to_list())\n",
    "        model_active = train_svm(annotated_active[\"X\"].to_list(), annotated_active[\"y\"].to_list())\n",
    "    \n",
    "        # evaulate and save results\n",
    "        f1s_ours_closer.append(create_classification_report(model_ours_closer, test[\"X\"].to_list(), test[\"y\"]))\n",
    "        f1s_ours_original.append(create_classification_report(model_ours_original, test[\"X\"].to_list(), test[\"y\"]))\n",
    "        f1s_random.append(create_classification_report(model_random, test[\"X\"].to_list(), test[\"y\"]))\n",
    "        f1s_active.append(create_classification_report(model_active, test[\"X\"].to_list(), test[\"y\"]))\n",
    "\n",
    "        # save number of samples\n",
    "        samples.append(annotated_random.shape[0])\n",
    "        \n",
    "        \n",
    "        to_return = [f1s_random, f1s_active, f1s_ours_original, f1s_ours_closer, samples]\n",
    "\n",
    "    return to_return"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "### add column with integers for categorical variables:\n",
    "\n",
    "## convert categories to integer labels\n",
    "from sklearn import preprocessing\n",
    "le = preprocessing.LabelEncoder()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def run_simulation(pooldf, RUNS=RUNS, STEPSIZE=STEPSIZE, NUM_STEPS=NUM_STEPS):\n",
    "    \n",
    "    ## First, create the df\n",
    "    ## Choose random categories from each list\n",
    "    rare_categories = choose_random_categories(potential_rare, 1)\n",
    "    freq_categories = choose_random_categories(potential_freq, 3)\n",
    "    \n",
    "    ## Output as single list\n",
    "    categories_ = rare_categories + freq_categories\n",
    "    \n",
    "    ## Convert the df, including ratios\n",
    "    converted_ = filter_categories_for_pool(rare_categories, freq_categories, pooldf)\n",
    "\n",
    "    ## Assigning numerical values and storing in another column\n",
    "    converted_['y'] = le.fit_transform(converted_['Category'])\n",
    "    \n",
    "    ### Train-Test split:\n",
    "    X_pool, X_test, y_pool, y_test = train_test_split(converted_['X'], converted_['y'], test_size=0.20, random_state=42)\n",
    "    y_pool = [str(x) for x in y_pool]\n",
    "    y_test = [str(x) for x in y_test]\n",
    "    pooldf = pd.DataFrame({\"X\": X_pool.tolist(), \"y\": y_pool})\n",
    "    test = pd.DataFrame({\"X\": X_test.tolist(), \"y\": y_test})\n",
    "    \n",
    "    new_categories = list(set(pooldf['y']))\n",
    "    \n",
    "    ## Run the simulation:\n",
    "    outputs = run_annotations(pooldf, new_categories, test=test)\n",
    "    \n",
    "    return outputs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "### Create dictionaries to hold all information for each run\n",
    "## Each key will be a run and the values will correspond to the f1 scores and samples\n",
    "## for each parallel method of the simulation\n",
    "\n",
    "random_dict = {}\n",
    "active_dict = {}\n",
    "ours_dict_closer = {}\n",
    "ours_original_dict = {}\n",
    "samples_dict = {}\n",
    "\n",
    "\n",
    "for run in tqdm_notebook(range(RUNS)):\n",
    "    outputs = run_simulation(df) # Return the outputs from the simulation's run\n",
    "    random_dict.setdefault(str(run),outputs[0])\n",
    "    active_dict.setdefault(str(run),outputs[1])\n",
    "    ours_original_dict.setdefault(str(run),outputs[2])\n",
    "    ours_dict_closer.setdefault(str(run),outputs[3])\n",
    "    samples_dict.setdefault(str(run),outputs[4])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Visualizing the Results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "## Find the mean vector for each dictionary (each dictionary corresponds to an annotation strategy)\n",
    "def extract_macros_mean(results_dict):\n",
    "    vecs = []\n",
    "    for row in results_dict:\n",
    "        vec = []\n",
    "        for i in (results_dict[row]):\n",
    "            vec.append(i['macro avg']['f1-score'])\n",
    "        vecs.append(vec)\n",
    "    mean_ = np.mean([v for v in vecs], axis=0)\n",
    "    return mean_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "## Visualize the performances\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "plt.figure(figsize=(12,8))\n",
    "## Note, sample size remains consistent throughout the multiple runs\n",
    "plt.plot(samples_dict['0'], extract_macros_mean(random_dict), label='Random Sampling')\n",
    "plt.plot(samples_dict['0'], extract_macros_mean(active_dict), label='Active Learning')\n",
    "plt.plot(samples_dict['0'], extract_macros_mean(ours_original_dict), label='Our Method')\n",
    "plt.plot(samples_dict['0'], extract_macros_mean(ours_dict_closer), label='Our Method - modified')\n",
    "plt.xlabel(str('# of training samples'), fontsize=14, fontweight='bold')\n",
    "plt.ylabel(str('f1 (macro)'), fontsize=14, fontweight='bold')\n",
    "plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')\n",
    "plt.setp(plt.gca().get_legend().get_texts(), fontsize='14')\n",
    "plt.tight_layout()\n",
    "\n",
    "### To save the graph:\n",
    "# plt.savefig('Comparing_Annotation_Approaches.png')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python (dror ch)",
   "language": "python",
   "name": "dror_ch"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
